Skip to content

Conversation

@andishgar
Copy link
Contributor

@andishgar andishgar commented Feb 19, 2025

Rationale for this change

While the extract_regex function returns substrings of the matching regex captures, extract_regex_span returns (index, length) pairs of these substrings relative to the original string values.

Are these changes tested?

Yes, by dedicated unit tests.

Are there any user-facing changes?

No, except a new compute function.

@andishgar
Copy link
Contributor Author

@pitrou could you review this issue?

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the compute document also need to add this?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 22, 2025
@andishgar
Copy link
Contributor Author

Would the compute document also need to add this?

Yes, this is a new compute function called extract_regex_span. Should I add changes somewhere?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @arashandishgar . There are a number of things to improve in this PR even though the overall implementation is sound, see comments below.

@andishgar
Copy link
Contributor Author

andishgar commented Feb 24, 2025

Appreciate your feedback! I’ll look into it and follow up soon.

@andishgar
Copy link
Contributor Author

@pitrou I've applied all changes that you suggest. Could You review again?

@andishgar andishgar requested a review from pitrou February 28, 2025 09:54
@assignUser assignUser removed request for kou and raulcd February 28, 2025 18:25
@assignUser assignUser removed their request for review February 28, 2025 18:25
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Here are some more comments.

Also, can you add some documentation to https://github.com/apache/arrow/blob/main/docs/source/cpp/compute.rst#string-component-extraction?

@andishgar
Copy link
Contributor Author

Thanks for your review, I'll look into it

@andishgar
Copy link
Contributor Author

andishgar commented Mar 7, 2025

Thanks for the update. Here are some more comments.

Also, can you add some documentation to https://github.com/apache/arrow/blob/main/docs/source/cpp/compute.rst#string-component-extraction?

Thanks for mentioning it; I wrote the relevant documentation.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @arashandishgar . Here are a couple more comments, can you take a look and address them?

@andishgar
Copy link
Contributor Author

Thanks for your comment. I will look into it

@pitrou pitrou force-pushed the extract_regex_span branch from 2764dc9 to 703a5ac Compare March 11, 2025 15:50
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made some minor changes and added Python bindings. I'll merge this PR if CI is green, thank you @arashandishgar !

@pitrou
Copy link
Member

pitrou commented Mar 11, 2025

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 703a5ac

Submitted crossbow builds: ursacomputing/crossbow @ actions-4358c6f365

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-meson GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou pitrou changed the title GH-44615: [C++] Add possibility to extract spans/byte offsets directly for compute.extract_regex GH-44615: [C++][Compute] Add extract_regex_span function Mar 11, 2025
@pitrou pitrou merged commit 0494115 into apache:main Mar 11, 2025
40 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Mar 11, 2025
@andishgar
Copy link
Contributor Author

I've made some minor changes and added Python bindings. I'll merge this PR if CI is green, thank you @arashandishgar !

I wanted to take a moment to sincerely thank you for your patience and constructive feedback during the code review process. As this was my first experience contributing to Apache Arrow, I learned a lot throughout the review, and your thoughtful suggestions were extremely valuable.

I realized that I made a mistake by not properly generalizing your suggestion across the entire codebase, especially in my second commit. In my rush to submit the pull request, I overlooked that point.

I truly appreciate your understanding, and I hope to be more diligent in applying feedback consistently in future contributions.

Thanks again for your time and support.

@pitrou
Copy link
Member

pitrou commented Mar 11, 2025

You're welcome, @arashandishgar . Feel free to contribute again!

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 0494115.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants